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Abstract 

Conditional mutual information is important in the selection and interpre¬ 
tation of graphical models. Its empirical version is well known as a generalised 
likelihood ratio test and that it may be represented as a difference in entropy. We 
consider the forward difference expansion of the entropy function defined on all 
subsets of the variables under study. The elements of this expansion are invari¬ 
ant to permutation of their suffices and relate higher order mutual informations 
to lower order ones. The third order difference is expressible as an, apparently 
assymmetric, difference between a marginal and a conditional mutual informa¬ 
tion. Its role in the decomposition for explained information provides a technical 
definition for synergy between three random variables. Positive values occur 
when two variables provide alternative explanations for a third; negative values, 
termed synergies, occur when the sum of explained information is greater than 
the sum of its parts. Synergies tend to be infrequent; they connect the seemingly 
unrelated concepts of suppressor variables in regression, on the one hand, and un¬ 
shielded colliders in Bayes networks (immoralities), on the other. We give novel 
characterizations of these phenomena that generalise to categorical variables and 
to higher dimensions. We propose an algorithm for systematically computing low 
order differences from a given graph. Examples from small scale real-life studies 
indicate the potential of these techniques for empirical statistical analysis. 

Keywords: Bayes network; Conditional mutual information; Irnsets; Mobius inversion; 

Suppressor variable; Unshielded collider. 


Introduction 


The independence of two random variables is denoted by X\ALX 2 , Dawid ( 1979 ), 
and the conditional independence of these two, given a third, by X\ALX 2 \ X 3 . The 
marginal mutual information of two random variables and the conditional mutual in¬ 
formation of two variables given a third are, in terms of the joint probability density 
or mass function, 
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I 12 = ini(X 1 lLX 2 ) = E log 
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and Ji 2|3 = inf(Xi_LLX 2 | X 3 ) = E log 
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The difference in these measures is 


inf(XiiLXa) - mf(Xi_LLX 2 | X 3 ) 


— E log 


/123/1/2/3 

/12/13/23 


( 2 ) 


The key property is that the right hand side is symmetric to any permutation of suffices 
1,2,3 even though the left does not appear to be. Define <$123 by the right hand side 
expression. 

Conditional indepen dence and mutual inf o rmati o n lie at th e foun d ations of graphica l 
models, for texts see Koller and Friedman (2009); Lauritzen (1996); Whittaker! (1990). 
The seminal c i tation for the separ ation properties of undirected graphical models is 
Darroch et al.l (119800 . Pearl! (1198811 made the big step in establishing acyclic directed 
graphs and the concept of d-separation for Bayes networks. In the class of these directed 
graphs certain subsets are probabilistically (Markov) equivalent, and an important 
theorem is that the skeleton and its unshielded colliders (or immoralities) specify the 
equivalence classes. The crit erion for this and the g ener alisation to chain g raphs was 
independentl y established by Verma and Pearl! ( 1990 1 and iFrvdenberd ( 199CH) . and later 
developed by A ndersson et ah (11997! ). 

Suppressor effects in multiple regression were first elucidated by Horst! ( 194lh . The 
phenomenon arises when the dependent variable has a smaller prediction error by in¬ 
cluding an additional explanatory variable that has no (or little) explanatory effect 
when used by itself; often this is manifest in enhanced regression coefficients. There is 
a social science literature concentrated in educational and psychological testing theory 
that has an interest in suppression because of its c oncern to design experiments that 
make predictions more precise. Ludlow and Klein ( 2014!) gi ves a subs t antial review 
of this area, and so we just mentio n a few other references: iMcNemarl (1194511 . lYover 


( 1996 1. Maassen and Bakker ( 2001 1. Shieh ( 20061) . The well known structural equa- 


Klinc (201111 cites suppression among one of the fundamental concepts of 


tions text 
regression. 

The technical literature on alternative wav s to define and expl a in suppression includes 
Vehcer (1978), Bertrand and Holder ( 1988 1. Smith et ah ( 1992 1. MacKinnon et al. ( 2000 1. 
Shieh! (l200lh . More recently (Friedman and W all (20 051 , give a survey and point out 
that the term synergism follows a suggestion of Hamilton (119880 . This literature distin¬ 
guishes several types of suppression. Classical suppression: X 2 , say, is the suppressor 
variable, it is uncorrelated (or nearly so) with Y, but adds to the predictive power of 
the regression Y on X\ when both included. Negative suppression: both X\ 1 X 2 have 
a positive zero-order correlations with Y, and correlate positively with each other, but 
one has a negative coefficient in the regression of Y on both. Reciprocal suppression: 
both variables are marginally good predictors of Y, but are negatively correlated. 


There are several seemingly different indicators of suppression, which are varyingly 
described by conditions on the correlations (marginal, multiple, partial), or in terms 
of regression and correlation coefficients, or in terms of explained variance, or even in 
terms of a rather confusing semi-partial correlation introduced by Velicer. All authors 
give conditions for three variable regression scenario, some attempt to generalise to p- 
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variables, and some explanations are geometric; most however reduce to conditions on 
correlation coefficients. That suppression is usually presented as a three dimensional 
correlation phenomenon does not make clear how to measure its strength, or how to 
generalise to higher dimensions; or how to generalise to other distributions. 


Our contribution is to show the 3rd-order forward difference of (J2]) relates the seemingly 
unrelated topics of immorality and suppression in a natural way. The condition for 
suppression is that <$123 < 0 ; noting the ph rase ‘the whole regression can be greater 
than the sum of its parts’ in the title of Bertrand and Holder) (119881 1 suggests that 
synergy is a good synonym for the triple 123. The condition for an unshielded collider 
(immorality) at 3 is that £123 < 0 and 8\2 — 0. 


To set this within a wider framework we write down forward difference expansion for 
the entropy function, and use Mobius inversion to calculate the differences given the en¬ 
tropies. All forward differences are invariant to permutation of their suffices. Marginal 
mutual informations are second order differences and conditional measures have addi¬ 
tive expressions in terms of the second and higher order forward differences. Higher 
order differences, are made more tractable by defining conditional forward differences. 


We interpret the negative third order forward differences as synergies. Classic exam¬ 
ples of graphical models in low dimensions illustrate the role of forward differences in 
interpretation of the model. A computing scheme for 3rd-order elements from a given 
graph based on cluster nodes is used to investigate empirical data for synergies. 


The forward differences of the entropy provides a wider framework to explore sup¬ 
pression and immorality. This setting explains why the essence of both phenomena 
concerns exactly three variables; and why suppression is symmetric. It distinguishes 
suppression from both mediation and confounding where 6 is positive. It generalises 
the notion of suppressor variables to higher dimensions and to other distributions, for 
instance to categorical data. It gives an alternative characterization of immoralities 
(unshielded colliders). 

Plan of the paper: In Section 2 we define the forward difference expansion of the entropy 
function and elaborate its properties. In Section 3 we make the connection to suppres¬ 
sor variables in regression and immoralities in Bayes networks, and give alternative 
characterizations of these phenomena. In Section 4 we consider more detailed applica¬ 
tions to the categorical and continuous data and examples from small scale empirical 
studies. Proofs are collected in the Appendix. An algorithm for systematically com¬ 
puting low order differences from a given graph is provided in Supplementary Material. 


2 Forward differences of the entropy 
2.1 Preliminaries 

The nodes in P — {1, 2,... ,p} correspond to random variables Aj, X 2 ,..., X p having 
a joint distribution. For subsets A,B,C of the power set V = {p, {1}, {2},..., P} 
conditional independence statements of the form X^ALXb \ Xq where Xa refers to the 
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vector (Xi,i G A) simplify the dependency structure of X{= X f 


The entropy function, h, is defined on V by h A = — E log /^(Ti), where / is the 
derivative of the joint probability measure. Without loss of generality we assume this 
is always well defined for, if not, we may replace it by the (negati ve) relative entropy 
-Bio g / A (x A )/n i6J fi(Xi), termed the multi-information by IStudenvl (120051 ). For 
a p-dirnensional mass point distribution the entropy is h A — — ^ Pa( x a ) log P a { x a ) 
where p A is the mass function on the margin determined by A. This is always non¬ 
negative. For a p-dimensional multivariate Normal distribution with mean zero and 
correlation matrix E the entropy is h A = 1/2 logde^E^). Any additive term, constant 
with respect to A, may be ignored since our concern is with entropy differences. Note 
that = 0 and that the notation h A presumes invariance to any permutation of the 
subscripts, justified because the underlying distribution is invariant. 


In reporting numerical values of the entropy, or more usually differences in entropy, h 
is scaled to millibits by multiplying by the factor 2 10 /log(2). The upper limit for the 
mutual information against independence for two binary variables with equi-probable 
margins is 1024mbits, attained when the variables always take the same value. For 
two Gaussian variables with correlation 0.5 the measure is 212.5mbits, but there is no 
upper limit. 


For disjoint sets A, B,C G V the conditional mutual information is 


Iab\c — aubuc + h Au c + h BuC — h c . 


( 3 ) 


It is useful to retain both the inf and / notations for this measure. The marginal 
information is I t] where the conditioning set is empty. We require the well known 
lemma that 


Iab\c 


X a ALX b | X t 


c- 


( 4 ) 


The proof uses the non-negativity of the Kullback-Liebler divergence between the joint 
distribution and the distribution factorised according to the independence statement. 


2.2 Entropy function expansion 


The entropy function h is defined on the power set of the nodes, {h A \ A G V}. The 
forward differences {<5 a; A G V} of the entropy are defined by the additivity relations 

h A = ^ 5b for A G V. (5) 

BCA 


Solving by Mobius inversion, Rota (I1964J ). gives 

5 A = ^(-l)!A-l B\ hB for AeV. 


BCA 


( 6 ) 
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Theorem 2.2.1 Symmetry of the forward differences. The forward differences are sym¬ 
metric, that is, any 5 a is invariant to any permutation of the indices within the set 
AeV. 

A detailed proof is given in the Appendix. 

We need the following lemma in a later section. 

Lemma 2.1 Additivity. When the entropy is additive, so that h aU b = h a + hj, for all 
a C A, b C B, non-empty a and b, and disjoint A, B C P, then 5 aUb = 0. The converse 
also holds. 

The proof, given in the Appendix, essentially invokes a triangular elimination scheme. 


2.3 Conditional mutual information and forward differences 

From (j3J) the conditional mutual information of two random variables Xf,Xj given a 
subset of others, Xa, is 

hj\A — b'Aij + hAi + hAj ~ h-Ai (7) 


where Aij is shorthand for A U {*, j}. The right h and s ide is the elementary imset repre¬ 
sentation, for pairwise conditional independence, Studenv ( 2005h . It is the scalar prod¬ 
uct of the entropy function with the imset (integer valued multi-set) (..., —1,1,1, —1...) 
of length TP that has zeros in the appropriate places. It is elementary because it rep¬ 
resents a single conditional independence statement. 


Theorem 2.3.1 Conditional mutual information and forward differences. The condi¬ 
tional mutual information can be expressed in terms of the forward differences, {h}, of 
the entropy function by 

hj\A = ~ ^2 Sb f° r iC^PAeV. ( 8 ) 

ijC.BC. Aij 


The subset {i, j} occurs in every term on the right of (JHJ). The first term is the marginal 
mutual information I ty Each 5 term on the right is invariant to permutation of its 
suffices. If the conditioning set A is of moderate size then there are only a moderate 
number of terms in the summation. 

Corollary 2.3.1 Third order forward differences. When A = {k} consists of a single 
element 

$ijk lrj l/j A:, and (9) 

hijk b'ij b'ik hjk T bti T hj T h (70) 
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This follows because setting A — 0 in (J8J) gives = —<%, and A = {k} gives 1^ = — (Sy + S^k)- 
Subtraction gives ([H]). The second statement is just the inversion formula (EJ) for 8^. 

This corollary locates the identity introduced at (J2J) within a wider framework. The 
key property is the difference 5 is symmetric in permutation of suffices i,j , k, as in (HU 
while intuitively the right hand side of (Ej) is not. 

2.4 Forward differences of the conditional entropy 

The conditional entropy function {h^\B] A £ V(P\B)} is defined on the restricted 
power set that excludes B where Iia\b — log /a|b(^a|^b)- The corresponding 
conditional forward differences are defined by (JHJ) and © giving {8a\b] A e V(P\B)}. 

The set notation in 8a\b makes evident the symmetry of the differences. 

Theorem 2.4.1 A recursion for conditional forward differences. For k G P, B C P\k 

and A e V(P\(B U k)) the conditional forward differences satisfy 

$A\Bk = &Ak\B + 8a\b- (11) 


When B is empty, the identity (HID shows that the higher order forward difference is 
the difference between a conditional and a marginal forward difference: 

8Ak = 8A\k — 8 a- 

The size of the higher order term is useful in assessing how much 8a might change 
by conditioning on a further variable. This is invariant to permutation of the set 
Ak = A U {k}. To illustrate with |4| = 3 , 81234 = £12314 — ^123 = <5’i24|3 — ^124 and so 
on. 

The identity (HID generalises to express a conditional forward difference as sums of 
conditional forward differences conditioning on a lower order: 

8a\buc — 8aud\b- 

DCC 


Theorem 2.4.2 Separation and the forward difference. Whenever C separates A and 
B in the conditional independence graph 8ayjb\c — 0 . 

The value of this result is that it allows easy interpretations of marginal forward dif¬ 
ferences in examples. There is a converse to this theorem if the condition on the 
conditional forward differences is strengthened. 
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2.5 Non-collapsibility of mutual information 


Collapsibility is important in statistical inference because it elucidates which properties 
of a l oint distribution can be inferred from a margin. Simpso n's paradox ISimpson 


( 19511 refers to a vi olation of collaps i bility; other references a re iBishop et al.l (119751) . 


Whitternorel ( 19781) . Iwhittakerl ( 1990 1. [Greenland et al.l ( 1999 1. among others. 


Consider three variables with Imj = 0 and corresponding independence graph 



with one missing edge. The strength of the relationship between X ?: and Xj is measured 
in two dimensions by I tJ and in three dimensions by Uj\k- If I were collapsible then 
Iij — Iij\ k = 0. But this difference is —Sij + 8 ^\ k = 8 ijk by Q, the 3rd-order difference. 
By symmetry 5ijk is also equal to 8 ^ — 8 ik and so 8 ijk = 0 together with 8 ^ = 0 would 
imply 8 ik = 0; which is false in general. The premiss that the measures are equal is 
untenable. 

Large values of 8 ^ indicate that conditioning on X *. modifies the strength of the 
relationship between Xj and Xj ; even though it is a symmetric measure this does not 
imply that that subgraph be complete. 

More generally requiring the collapsibility of 8 a in the space A U B requires 8 a\b = 8 a] 
by Theorem 12.4.21 this is equivalent to Xa-AXb- 


3 Synergy, suppression and immorality 
3.1 Synergy 

The information against the independence of two variables is synonymous with the 
information explained in one variable by predicting from the other. 

Theorem 3.1.1 Explained information. The explained information in one variable ex¬ 
pressed in terms of the marginal mutual information of others is 

mi{X k ALX A ) = 5>if(X fc XAb)- E 8 bu ) (12) 

i£A BCA,\B\>1 

where the last summation is over subsets B with at least 2 elements. In particular 

inf (X k _LL(Xj ,Xj)) = inf(X fc XX i ) + mi(X k ALXj) - 8 ijk . (13) 
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The proof is included in the Appendix. When S ijk < 0 the triple {i,j, k} is called a syn¬ 
ergy, as the total information explained exceeds the sum of the marginal informations 
taken alone. It is appropriate to label the triple a synergy, rather than the variable k, 
since (U3l) is invariant to permutation of the indices. 


Corollary 3.1.1 Partially explained information. The explained information in one 
variable expressed in terms of the marginal mutual informations of variables in A ad¬ 
justed for variables in B is 

mi(X k JLX A \X B ) = Y, inf^iLXi | X B ) - E $Ck\B- (14) 

ieA CCA,\C\>\ 

When there are just two variables in A 

inf (X k AL(Xi,Xj) | X B ) = inf {X k ALX r \ X B ) + inf (X k ALXj \ X B ) - S ijk \ B , (15) 


the sum of the parts adjusted by the conditional 3rd-order difference. 


The proof follows the previous argument and is straightforward. When 5ij k \ B < 0 the 
triple {i,j, k} is also called a synergy though a conditional or partial synergy is more 
specific. 


3.2 Suppression 


The term suppressor variable is used in regression applications where there is an contex¬ 
tual asymmetry between the dependent and the explanatory variable, see the introduc¬ 
tory section. The suppressor variable describes a third variable which is uncorrelated 
(or nearly so) with the dependent variable, but adds to the predictive power of the 
regression on both; this is technically described by S tjk < 0 from (1T3|) together with 
Iij = 0. The corresponding Bayes network is displayed in Figure [0 



Figure 1: Suppression and immorality; in a supressor regression context j is the de¬ 
pendent variable, k the explanatory variable, and i the supressor variable. 


The diagram makes clear that suppression is symmetric in the sense that the variables 
i and j are interchangeable. Elaboration of this condition in terms of correlations is 
the content of Theorem 14.11 1 in the Applications Section. 

Expressing the criterion in the more general framework of information theory, extends 
the idea of suppression in linear regression to variables measured on other scales with 
well defined information measures, including categorical data. Examples are given in 
the next section. 




Expressing suppression in the general terms of information, clarifies the issues when 
more than two explanatory variables are involved. Recognition of a synergy in a partic¬ 
ular context could just reduce to calculating the conditional 3rd-order difference 8ijk\B- 
Screening for synergies or partial synergies involves repeated calculations, there are 
many triples as ways of choosing the two explanatory variables from the candidate set. 

An alternative direction is to develop (TEjlh For instance with three explanatory vari¬ 
ables and variable k = 1 dependent the synergy criterion becomes 

$123 + $124 + $134 + $1234 < 0. 

If, as well, only $ 12 34 < 0 suppression is truly a function of the three explanatory 
variables taken as a whole. 


3.3 Immorality 


The concept of an unshielded collider, Pearll ( 1988), is key for understanding ^separation 
in Bayes networks on acyclic directed graphs. Laur i tzen and Spie gelhalterl (119881 ) refer 
to the same concept as an immorality. Bayes networks with different directions may 
be probabilistical l y equi valent and an important result in this area, Frvdenbere ( 1990h : 
Verma and PearJ ( 1990 ). is that the equivalence class is characterized by the skeleton 


of the graph and its unshielded colliders. An unshielded collider is displayed in Figure 
[Qfor three variables, where the absence of an arrow joining i and j indicates X l ALX J . 
Consequently fj = 0, but Ii 3 \k > 0 so that Sijk < 0 by ([9]), the same condition as for 
suppression. 


In a Bayes network with additional antecedent variables the condition for an unshielded 
collider requires that k of Figured] is not in the separation set A for which f 3 \A = 0. 
This translates to 


Theorem 3.3.1 Characterization of an unshielded collider. In a Bayes network the con¬ 
dition 


Iij\A d and djjk 14 A 0 


(16) 


where $^fc|A is the 3rd-order conditional forward difference is necessary and sufficient 
for an unshielded collider at k. 

The proof is just a rephrasing of the definition of unshielded collider. 

This result gives an interpretation for negative conditional 3rd-order forward differ¬ 
ences, and suggests a method of identifying immoralities in a Bayes network. 


3.4 Systematic computation of low order forward differences 

The forward differences of the entropy offer low dimensional summaries of the data. 
Consider which differences to compute. For a small number of variables, all are pos¬ 
sible, but for moderate and large numbers this is a formidable task. Furthermore the 
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differences required may be different for the analysis of regression and suppression, for 
contingency table analysis and collapsibility, and for direction determination in Bayes 
networks. 


In the analysis of a candidate graph second order differences are routinely computed as 
marginal informations for all pairs. Third order differences complement the information 
from the nested pairs and may flag suppression and immorality, which are interesting 
because they are infrequent. We propose these are computed for a subset of the triples 
of a given graph. Fourth order difference show changes in third order differences and 
hence may flag conditional synergies; we suggest these are only computed in relation 
to specific triples of interest. 

Requirements: The subset of triples are required to cover the graph without unnecessary 
computing. In particular previously computed differences should not be recomputed 
nor should redundant ones that have an priori zero value with respect to the graph. 
Because higher order conditional mutual informations are additive in lower order dif¬ 
ferences, see flHJ), it is desirable to require nested subsets, so that for example if a fourth 
order difference is computed its corresponding lower order differences are available. 

Redundancy: For a given conditional independence graph certain forward differences 
are either identically zero, or reduce to linear combinations of lower order differences. 
For instance if i and j belong to separate connected components of the graph <5,4 = 0 
whenever A includes both i and j. Consequently when a putative graph describing the 
dependencies in the data is given, not all forward differences are interesting. Note that 
bijk — 0 whenever the subgraph of the triple is not connected. The proof is straightfor¬ 
ward: the subgraph is not connected when inf(AA-LL(Aj> Xj)) = 0. Consequently both 
Iik = 0 and Ijk = 0 and so 8^^ = 0 by (TTST) . Only connected triples have interesting for¬ 
ward differences. Restricting attention to subsets that have complete subgraphs with 
respect to a given graph satisfies the nesting criterion, but would disallow computation 
of a third order difference on the chain in Figure [U 

Node clusters: We suggest that a subset of nodes in which one node has an edge to every 
other node is a configuration for which it is appropriate to compute forward differences 
of an order up to the subset size. Node clusters of this form have an approximate 
nesting structure: all but one subsets of the cluster are clusters themselves, and if any 
one edge is dropped from the cluster, it leaves a cluster. The complete subgraph on 
any number of nodes is a node cluster where any one of the vertices may take the role 
of the cluster node. Certain configurations are eliminated, for instance, a chain or a 
chordless cycle on four variables. A subset of size 3 forms a node cluster if one node is 
adjacent to both others. 

An algorithm based on this concept is included in the Supplementary Material. 


Collider colouring of the synergies: Synergies are infrequent and so of interest. They are 
a property of a triple and so more difficult to portray than a node. However the node 
opposite the weakest edge may be singled out as a collider, generalising the term used 
in Bayes networks, Pearl ( 1988h . When the weakest edge has zero mutual information, 
so that its nodes are marginally independent, then the resulting configuration is an 
unshielded collider (immorality) and the two notions coincide. 

The collider may be indicated by a colour (red, say) and the other two nodes yellow, 
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Colouring the two edges adjacent to the collider indicate the other elements of the 
triple. Additional rules are needed for overlapping synergies; for instance, any node 
tagged as a collider is overprinted as a collider. This can lose some detail of the synergy 
in the graph. 


4 Applications of forward differences 


We give some low dimensional examples of forward differences of the entropy, both 
theoretical and empirical, for categorical and continuous data. In three dimensions 
the third order differences quantify the difference in mutual information between two 
variables with and without conditioning on a third. We compute and display these 
differences from some known standard models numerically and, where possible, give an 
analytic condition for a synergy. A difference is measured in millibits, the same units 
that measure entropy. For continuous data, we elaborate the conditions for suppression 
for a theoretical variance matrix with a known graph structure, and give some simple 
examples. For categorical data we illustrate synergy with examples of binary data in 
three dimensions, and relate these to the issue of collapsibility. We elucidate examples 
of four dimensions continuous models that are interesting in the context of Bayes 
networks. 

Higher dimensional examples discuss 3rd-order forward differences and synergies using 
the skeleton of the Bayes network, known or postulated to have generated the data. 
Firstly from an artificial tree averaging process, which establishes why the skeleton 
rather than the moral graph is the right graph to determine which differences need 
to be computed. Secondly the real-life example of wine quality data is analysed and 
the synergies suggest that a chain graph model might represent the structure of the 
variables well. The analysis of the carcass data leads to similar conclusions, but is 
included because it is easily accessible through R. 


4.1 Three dimensional correlations 

The lower off-diagonal elements of the correlation matrix £ are P 12 , P 13 , P 23 , constrained 
by requiring £ to be positive definite. The Gaussian entropy function is given in the 
preliminary remarks to Section [21 The power set has 2 3 elements, the entropy of the 
singleton sets are standardised to zero and all others are negative. The 2nd-order 
forward differences are negatives of the marginal mutual informations, so that the 
information against the independence of X\ and X 2 is 5\ 2 = —I 12 = log(l — p\ 2 )/ 2. 

Theorem 4.1.1 Synergy with three Gaussian variables. The 3rd-order forward, differ¬ 
ence is 


^123 


- log 

2 6 

- log 

2 6 


1 ~ Pl2 ~ Pl3 ~ P23 + 2pi2Pl3P23 

(i-/4)(i-p? 3 )(i-/4) 

1 — Pl2|3 

1-P?2 ’ 


(17) 

(18) 
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and the condition for a synergy, 8122 , < 0, is that one marginal correlation coefficient is 
smaller than its corresponding partial in absolute value, for instance \p\ 2 \ < |pi 2 | 3 1- 

The proof is in the Appendix. 

Corollary 4.1.1 Synergy and negative correlation. A synergy occurs whenever exactly 
one marginal correlation is negative. 

There are only two cases of correlation matrix to consider: one where all coefficients 
are positive and the other where exactly one is negative. The corollary deals with 
the second. It follows because if one correlation is negative then the corresponding 
partial, say pi 2 | 3 = (pi 2 ~ Pi3P23){(l — p? 3 )(l — P23) 5 5 exceeds the marginal in terms 
of absolute value since the numerator is inflated and the denominator deflated. 

In regression scenarios the condition that one marginal correlation is negative may 
be subdivided by whether the correlation is between a response and an explanatory 
variable, or between two explanatory variables. This corresponds to the classification of 
suppression into type: negative or reciprocal, occuring in the literature on suppression 
and briefly reviewed in the Introduction. The special case p 12 = 0 corresponds to 
classical suppression. 

Of interest to us is that a synergy does not occur when p 12 | 3 = 0, and the inequality 
condition |p 12 | < |p 12 | 3 | is invariant to permuting indices. 

Example 1. (Numerical): For a numerical illustration the forward differences are dis¬ 
played using £ specified by its lower triangle 1.0, 0.2,1.0, 0.7, 0.5,1.0, and for compari¬ 
son, of the same £ with 0.2 replaced by —0.2. The forward differences are, respectively, 


subset 


1 

2 

3 

12 

13 

23 

123 

fwd.diff(pi 2 = 0.2) 

0 

0 

0 

0 

-30.15 

-497.4 

-212.5 

-14.63 

fwd.diff(pi 2 = —0.2) 

0 

0 

0 

0 

-30.15 

-497.4 

-212.5 

-1126.0 


The values are reported in millibits, see the preliminaries to Section [21 In both the infor¬ 
mation against AG-LLA" 2 is 30.15mbits. In the first instance the 3rd-order difference h 123 
is —14.63mbits so that the information against Ai_LLA 2 | A" 3 is 30.15 + 14.63 = 44.78mbits. 
In the second instance <5 i 23 = —1126mbits indicating a much more substantial synergy. 

The result ffT8l) generalises easily to give a condition for partial synergy. 

Corollary 4.1.2 Partial synergy with three Gaussian variables. The 3rd-order condi¬ 
tional forward difference for three Gaussian variables given a set A of other such vari¬ 
ables is 

Sum = llogl^i (19) 

Z 1 P 1'2\A 

and the condition for a partial synergy, 5 i 2 z\a < 0, is that one marginal correlation coef¬ 
ficient is smaller than its corresponding partial in absolute value, that is \pi 2 \a\ < |pi 2 |A 3 |- 
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4.2 Three dimensional contingency tables 

While the value of the 3rd-order difference clearly flags the phenomenon of suppression 
in regression it does not give a definitive answer to non-collapsibility in three way tables. 

We consider three examples related to Simpsons paradox: the first is an archetypal 
loglinear model, the second is numerical and the third is real-life. 

Example 2. (Analytic): The first example of a 2 3 -table is analytic where each margin 
shows independence but the three variables are dependent. A priori the 3rd-order 
forward difference must be negative. 

The log-linear is expansion of P 123 on {0, l } 3 is 

log (a) + (xi + x 2 + x 3 ) log (/3/a) - 2(x 1 x 2 + Xix 2 + x 2 x 3 ) log (/3/a) + Axix 2 x 3 log( / d/a),( 20 ) 


where xi,x 2 ,x 3 take values 0,1; and parameterised by a G (0,1/4) with f3 = 1/4 — a. 
In standard order the joint probabilities are (a, /3, (3, a, /3, a, ct, (3). It illustrates non- 
collapsibility because every margin has equi-probability entries so that inf(W_LLXj) = 0, 
while any two variables contribute positively to the prediction of the third. 

By direct evaluation, 

5i23 = -4 (a log(cc) + /3 log(/3)) - 3 log(2). 


This is zero when /3 = cc, but otherwise negative. 


Example 3. (Kidney stones): This is taken from Julious and M ulle e (119941 1 has pre¬ 
viously been used as a real-life instance of Simpson’s paradox. There are two factors 
(Treatment, Size), each with two levels (A/B, small/large stones respectively). Out¬ 
comes (81/87, 234/270, 192/263, 55/80) are recorded as the success/total count in the 
four groups, in Treatment within Size order. 

The entropy function and its forward differences are displayed here 


subset 

$ 

0 

T 

S 

OT 

OS 

TS 

OTS 

entropy 

0.0 

733.4 

1024.0 

1023.7 

1754.9 

1725.8 

1835.3 

2533.7 

fwd.diff. 5 

0 

733.4 

1024.0 

1023.7 

-2.443 

-31.31 

-212.4 

-1.198 


with values in millibits. The T margin is exactly balanced (1024mbits is the maximum), 
and the S margin almost so, but the T x S table is not (the mutual information is 
212.4mbits and far from zero). The value of 5ots — — 1.198mbits is negative; it is also 
negligible so that marginal and conditional independence measures are approximately 
the same. The independence graph approximating these data is 



Here Simpson’s paradox occurs when comparing the OT interaction conditionally on 
S, with its value marginalised over S', and arises because of the large imbalance in the 
TxS table. The value of 5qts does not signal the paradox. 
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It is easy to construct examples where the paradox (log odds ratio in the marginal and 
in the conditional tables are of opposite sign) goes with a negative and examples with 
a positive third order difference. 


4.3 Four dimensional correlation matrices 

Example 4 . (Analytic): The forward differences of the entropy are calculated from 
the theoretical correlation matrix of various four dimensional graphical models. We 
compute forward differences all orders, though report only the most salient features to 
illustrate what may be expected if data is generated from such models. The graphical 
models, characterized by the graphs in Figure [ 2 j include the so-called cluster model, 
a chain, a decomposable model, the 4 -cycle, Bayes networks with one, two and three 
unshielded colliders. 

The interpretation of their differences derives from the separation properties of the 
graph translating to a statement of the form 5aub\c — 0 in Theorem 12 . 4.21 this in 
turn leads to one or more linear relationships using Theorem 12 . 4.11 These results are 
summarised in Table [T| 

(a) Cluster: This node cluster is a sparse configuration sufficiently complex that 
64234 is not zero. The variables X4,X 3 ,X 4 are mutually independent given the clus¬ 
ter node A 2 , consequently three of the four 3 rd-order differences involving the clus¬ 
ter node X2 are positive as the information conditioned on A" 2 is zero. The term 
6434 is necessarily positive, for instance, because A 3 _LLA 4 | X 2 , and Ah is a predic¬ 
tor of X 2 , so that 134 > / 34 |i > -Z34 \ 2 = 0 (or equivalently 634 < 6 34 |4 < 6 34 | 2 = 0). As 
0 = Ai 34 |2 = 6134 + 61234, 61234 is always negative. 

(b) Chain: Xi_LLX 3 | X 2 implies 6423 > 0 , similarly all other triples have a positive for¬ 
ward difference. That Xi_LLX 34 | X 2 implies 643412 = 0 ; this, together with the identity 
613412 = 6134 + 64234 involving the fourth order difference, implies 64234 < 0 . Th depen¬ 
dence structure of this graph is characterized by the values of {642, 623, 634, 6423, 6234}. 

(c) Decomposable: A 1 _LLA 3 | A" 2 implies 6423 > 0 , similarly h 124 > 0 . Because 6 134 |2 = 0 = 6 134 + 64234 
they are of opposite sign, but otherwise arbitrary. 

(d) 4 -cycle: There are two independences leading to two zero linear combinations: 

X 1TLX3 | X 2 4 translates to 643124 = 0 = $13 + 6123 + 6434 + 61234, and A 2 _LLA4 | X43 trans¬ 
lates to 6 2 4p3 = 0 = 6 2 4 + 6423 + 6i 2 4 + 64234. We argue that 643 < 64312 < 643124 = 0 
because the information decreases as the conditioning set is enlarged. Consequently 

6423 > 0, and symmetry shows the other 3 rd-order differences are positive. Also 0 < 643412 = 6434 + 64234, 
so that 64234 < 0. 

(e) bayesNetA: There are two independences manifest: firstly A 2 _LLA 4 | X4 translates 

to 6 2 4|4 = 0 = 6 2 4 + 6424; secondly A4_LLA^3 | A 24 translates to 643124 = 0 = 643 + 6423 + 6434 + 64234. 

The 3 rd-order difference 6 12 4 is positive and there is a partial synergy at 3 as 6 2 3 4 |4 < 0 . 

(f) bayesNetB: There are two independences: X2-LLX4 implies 624 — 0; secondly A4TLA3 | A" 24 
is again 6 13 | 2 4 = 0 = 6 13 + 6423 + 6 134 + 64234. There are two marginal synergies at 1 and 

at 3 so 6424 < 0 and 6234 < 0 . 
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(a) cluster 


0 


(b) chain 


0 


(c) decomp 



(d) 4-cycle 


=0 - 0 


© ©—©—0 ©— 0 —© 0 —© 


(e) bayesNetA 


0 -© 


©-© 


(f) bayesNetB 


0 -© 


& - © 


(g) bayesNetC 


0 


0 —•©.— 0 


Figure 2: Four dimensional configurations of independence graphs (undirected and 
directed). 


Table 1: Summary of four dimensional forward differences for examples in Figure [21 


(a) cluster 

( b ) chain 

(c) decomp 

(d) 4-cycle 

(e) bayesNetA 
(/) bayesNetB 
(g) bayesNetC 


03 

+ 023 

= °, 

04 + 024 — 0, 

04 + 034 — 0, 


023 > 0, 

024 

> 0, 034 > 0, 



034 > 0, 

034 

+ 0234 — 0, 0234 < 0. 


03 

+ 023 

= 0, 

04 + 024 — 0, 

04 + 034 — 0, 

04 

023 > 0, 

024 

> 0, 034 > 0, 

034 > 0, 


04 

+ 024 

+ 034 + 0234 — 0, 0234 < 0. 


03 

+ 023 

= 0, 

04 + 024 — 0, 

023 > 0, 024 

>0, 

034 + 0234 — 0 

, 0234 arbitrary. 



03 

+ 023 

+ 034 + 0234 — 0, 04 

+ 023 + 024 + 

0234 

023|4 > 0; 

024|3 > 0, 034|2 > • 

A 034|1 > 0. 


04 

+ 024 

= °, 

03 + 023 + 034 

+ 0234 = 0, 


024 > 0, 

0341 

i < 0, other 5s arbitrary. 


04 

= 0, 

03 + 

023 + 034 + 0234 

= 0, 



0, 


0, 


023 + 034 + 0234 — U, 
< 0, other 5s arbitrary. 

: 0 , 04 = 0 , 034 = 0 , 


< 0 , 


: 0 , 0 , 

034 < 0. 
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(g) bayesNetC: There are three marginal independences between pairs of X j, X 3 , X 4 
with corresponding 2nd-order differences being zero; and as these variables are mutually 
independent the 3rd-order difference is zero too. There are three marginal synergies at 
2 , with 5i23 < 0 , $124 < 0 and $234 < 0. 


Example 5. (GP burn-out): This example was used by M aasse n an d B akke r ( 20011 ) to 
illustrate suppression in the context of path anaysis. We use it to illustrate forward 
differences of the entropy in four dimensions, Surprisingly we find that there are no 
synergies in any of the three dimensional margins nor any partial synergy in four 
dimensions, and consequently no colliders. 


A two wave study of burnout among 207 general practitioners measured levels of 
the lack of job satisfaction and of burn-out. The variables here are denoted by 
jsl, js2, bol, bo2, with the numeral denoting the wave. The correlation matrix, re¬ 
ported in supplementary material, shows all marginal correlations to be positive. For¬ 
ward differences of the entropy higher than the first are 


subset 
2nd-order fwd.diff 
subset 
3,4-orders fwd.diff 


jsl:bol 

-191.6 

jsl:bol:js2 

66.96 


jsl:js2 

-98.9 

jsl:bol:bo2 

103.7 


bol:js2 

-123.9 

jsl:js2:bo2 

71.63 


jsl:bo2 

-114.5 

bol:js2:bo2 

118.5 


bol:bo2 

-356.9 


js2:bo2 

-259.2 

jsl:bol:js2:bo2 

-61.96 


The 2nd-order differences (pairwise Mis) are all substantial; the 3rd-order differences 
are all positive, so clearly there are no synergies in any three dimensional marginal. 
There are two (approximate) linear relations corresponding to the 2nd-order statements 
JslALbo2 | {js2,$o2}. Sj s l:bo2 T $jsl:js2:feo2 T $ jsl:bol:bo2 T &jsl:bol:js2:bo2 1.13mbitS, and 
bolALjs2\{jsl, bo2}. Sbol:js2 T $j.s 1 :bo 1 :_/.s2 T ^bol:js2:bo2 ~b & jsl:bol:js2:bo2 0.40lIlbitS. 

This suggests the 4-cycle with graph 



Standard model fitting using the R-packages pcalg, gRim, or ggm gives the same inde¬ 
pendence graph. 

The context suggests that synergies might be found at one or both of the second wave 
nodes: for js2, 5 jsl:js2: bo2\boi = Sjsi-.js2-.bo2 + S jsl:bo i: js2:bo2 = 9.67mbits and for bo2, 
Sboi-.js2-.bo2 + Sjsi:boi:js2:bo2 = 56.55mbits. However both are positive indicating that this 
is not the case, and we conclude there are no suppression effects manifest in the observed 
data. 


4.4 Higher dimensions 

Example 6. (A tree averaging structure): This artificial tree averaging process provides 
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an example of computing third order forward differences with respect to a given graph. 
The process starts with a founding generation of independent Gaussian random vari¬ 
ables. Pairs of these are parents to a single child, giving a new generation of half the 
size; and the process repeats until only one successor is left. The parent-child relation 
is specified by the parameter a in 


Xchild, Ot\Xparl A pa r 2 .) T ^ 

where e are independent standard Normal. The correlation matrix is determined by 
the parameter a. With 8 founders there are p — 15 variables; so that in principle there 
are 455 subsets of size 3 to examine. 

The Bayes network generating the process is displayed in Figure El 



Figure 3: The Bayes network generating the tree averaging process. 


Emulating a data processing exercise with observations on this process would lead to 
the skeleton with 19 triples or to the moral graph with 19 + 12 = 31 triples. Recall 
that a triple with respect to a graph, is a subset of size 3 with (at least) one node 
adjacent to the two others. 

In the moral graph there are seven synergies (negative third order forward differences) 
that exactly correspond to the seven immoralities in the graph. With a = 0.6, the 
strongest synergy is at the apex of the pyramid (-164.02mbits), followed by two in next 
tier (-116.78mbits) and the four weaker ones at the bottom tier (-53.66mbits). The 
positive differences each correspond to a child—parent—grandparent conditional inde¬ 
pendence. The four stronger ones (65.90mbits) are at the apex of the tree and involve 
the final survivor Xp, the other eight positive ones (44.06mbits) involve a founder node. 

There are exactly twelve differences that are identically zero corresponding to morali- 
sation: applying d-separation, for instance to the 2,3,4 triple, marginally X 24 ALX 3 , so 
that both 1-2 3 and T 2 3|4 are zero. In large graphs it is more efficient to compute low 
order forward differences from the skeleton rather than from the moralised graph of a 
Bayes network. 


Example 7. (Carcass data): A well known d ata se t is the so-called carcass data available 
from the R-package gRim, Hpisgaard et al. (2012); the correlation matrix is reproduced 
in supplementary material. It consists of 7 nutritional content measurements on 374 
pigs(?). The skeleton is found using the pcalg R-package, Kali sch et al. (2012J), with 
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standard settings and a 5% significance level for edge testing, gives the graph of the 
skeleton as the left diagram in Figure EO With this graph there are exactly nine node 
clusters of order 3 (triples). The corresponding forward differences are listed in Ta¬ 
ble [2j Most of entries are positive, and quite a few are large indicating large duplication 
of effects, especially within the Fat measures and within the Meat measures. Strik¬ 
ingly there are two overlapping synergies (the negative differences). They have the 
same collider LeanMeat, and the nodes of the synergies are coloured in the graph on 
the right using the colouring rule above. Reading from the graph Fatll and Meatl3 


Table 2: Third order forward differences for the carcass data based on the graph. 



Nodes 


<5 

Fatll 

Meat 13 

LeanMeat 

-78.13 

Fat 12 

Meat 13 

LeanMeat 

-76.55 

Meat 12 

Meat 13 

LeanMeat 

32.32 

Meat 11 

Meat 13 

LeanMeat 

50.54 

Fat 12 

Fat 13 

LeanMeat 

458.18 

Fatll 

Fat 13 

LeanMeat 

460.97 

Fatll 

Fat 12 

LeanMeat 

514.10 

Fatll 

Fat 12 

Fat 13 

694.67 

Meat 11 

Meat 12 

Meat 13 

894.42 


are marginally independent and together enhance LeanMeat more than their separate 
effects would warrant. The same is true of the effect of Fatl2 and Meatl3 on Lean- 
Meat. Both of the se sy nergies suggest that the data be modelled as a chain graph, 
Wermuth and Lauritzen f 199Cll ). with LeanMeat as the single outcome variable. 



Figure 4: Skeleton of the carcass data (left) with two overlapping coloured synergies 
(right). 


Going from the coloured graph may be misleading without access to the corresponding 

table of synergies; for instance the graph might be taken to indicate that {Fat 11,Fat 12,LeanMeat} 

is a synergy when it is not. 


Example 8. (Wine quality data): We consider a regression example of wine quality 
taken from the machine learning data set repository at UC1 (archive.ics.uci.edu/ml/datasets/ 
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Wine+Quality). There are 4898 observations on 11 physico-chemical properties and 
a sensor y qu ality variable for the white Portuguese Vinho Verde wine repo rted by 
Cortez ei ah ( 2009 1. The red wine data was used as one of the test sets in Elidan 


( 20101 ). The quality outcome is an ordered categorical response, the other variables 
are continuous. Our objective is to find and display any synergies in the explanatory 
variables so leading to a better understanding of the data set. 


An exploratory analysis reveals transformations are required to establish linearity and 
normality. The simple approach of taking the normal scores, based on ranking each 
variable, produces pairs plots for the bivariate margins that are now almost all uni¬ 
formly ovaloid. 


The skeleton is found using the pcalg R-package, Kalisch et al. ( 20121 ). with standard 
settings and a 1% significance level for edge testing. There are 21 edges and 51 triples in 
the skeleton compared to 55 and 165, respectively, in the complete graph. The empirical 
cumulative distribution function of the corresponding 3rd-order forward differences is 
displayed on the left in Figure [5] The majority of the 3rd-order differences are near 



Figure 5: Left: empirical cdf of the 3rd-order differences computed from the esti¬ 
mated skeleton of the wine data; Right: the skeleton of the wine data with synergies 
(<—15mbits) coloured. 

zero. There is clearly one large synergy, four others of some size, three large positive 
forward differences, and five more of some size. The detail is given in Table [3j 

The skeleton on the right of the Figure is coloured with the five synergies in the Table 
that are stronger than — 15mbits. This has the effect of classifying the explanatory 
variables into red, yellow and white. Each red node belongs to one or more synergistic 
triples and is defined as the node opposite the weakest edge. There are just two red 
nodes: density, which occurs in four synergies, and total.sulfur.dioxide occurring once. 
Three of the four synergies including density are immoralities and make density an 
unshielded collider. The context of this physico-chemical data set suggests a causal 
mechanism in which density and total.sulfur.dioxide are responses directly affected by 
the yellow nodes in a synergistic relation. Interestingly density and total.sulfur.dioxide 
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Table 3: Larger synergies for the wine data, with the node to be coloured red indicated 
by the asterisk. 


triple 3rd-order diff. 


residual, sugar 

density* 

alcohol 

-61.74 

volatile, acidity 

free.sulfur.dioxide total.sulfur.dioxide* 

-25.49 

fixed, acidity 

residual, sugar 

density* 

-23.29 

fixed, acidity 

density* 

alcohol 

-19.64 

residual.sugar 

chlorides 

density* 

-18.50 

chlorides 

total. sulfur. dioxide 

alcohol 

79.46 

chlorides 

total. sulfur. dioxide 

density 

84.40 

residual, sugar 

total. sulfur. dioxide 

density 

139.82 

t otal. sulfur. dioxide 

density 

alcohol 

158.78 

chlorides 

density 

alcohol 

184.43 


are associated, but not synergistically, occurring together in the three of the largest 
five positive forward differences. The white nodes are not members of any synergistic 
triple. This colour ed classificati on of n odes sugge sts further fitting the data as variables 
in a chain graph, Wermuth and Lauritzen (Il990ll . 


5 Discussion 


To turn forward difference estimation into a practical statistical tool requires a reliable 
method of assessing sampling errors. This is clearly necessary in empirical estimation 
though perhaps less so in the testing scenarios of graphical model search. For now 
we make two remarks. Firstly, it is probable that lower order conditional mutual 
informations have smaller sampling errors than any associated higher order measures, 
as these have fewer additional terms in their expansion. Secondly the sampling error 
of the highest order order term in the forward difference expansion of information is 
probably of the same order of variability as the information itself. However in the 
absence of good approximations to sampling errors the parametric bootstrap should 
work well. 

Forward differences of the entropy function promise a productive vein of research related 
to graphical models. For instance the additive expansion of the conditional mutual 
information statistic in terms of 3rd-order differences give a particularly simple proof 
of the so-called information inequality. The potential efficiency gains in graphical 
model constraint based search might be leveraged to attain or surpass that of current 
algorithms such as pcalg mentioned above. A difficult problem is to locate and evaluate 
higher dimensional synergies of the form S t jk\A < 0 where the subset A is abitrary. A 
possible line of research is investigation if synergies for shielded colliders have a role to 
play in understanding causal graphs. 

Parallel to forward differences are backward differences generated by inverting the 
lattice of entropies and taking hp as the minimal and as the maximal elements 
respectively. A better way to study this might be to take the forward differences of 
the conditional entropy function 1ip\a{P\A) on {A G V}. It is quickly seen that the 
2nd-order differences are pair-wise mutual informations conditioned on the all other 
variables, and 3rd-order differences are 5ijk\p\ijk- 
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Appendix Proofs 

Proof of Theorem 12.2.11 The notation 5a is shorthand for 5u-^a) where the round 
brackets indicate an ordered sequence. We wish to show S^a) — $a for any permutation 
7r. 

We argue by enumeration on |v4|. For |A| = 1 there is nothing to show. For |v4| = 2 with 
A = {i,j} say Ha = 5^ + 5* + 5j + 5 tJ and h^A) — 5$ + Si + 5j + ^(p)■ As the entropies 
are equal, subtraction shows 5 n ^j) = 5ij so that the 2nd-order forward differences are 
symmetric. For |A| = 3 a similar result is attained using the symmetry of the 2nd-order 
terms. The argument continues until \A\ — p. □ 

Proof of Lemma 12.11 Note that 5^ = 0 but the term is included to preserve symmetry. 
The forward difference expansion (J3J) of h aU b is 

hauft ^ ^ 5 C &</> + h a T hb T ^ ^ 5 C . 

cCaUb c;|anc|>l,|bnc|>l 

The last summation on the right is the sum over terms with at least one element from 
a and one from b. By hypothesis it is 0. 

Direct enumeration on the elements (|a|, |6|) G {1, 2,..., |A|} x {1, 2,..., \B\} shows 
that every 5 C in this summation is 0. Start with singletons a = {i}, b = {j}. The only 
term is 5 t] and so it is 0. Repeat this over all pairs ij. A similar argument applied 
to a = {i}, b = {j, k} and using Sij = 0 establishes 5ijk = 0, for all k. Repeating this 
argument establishes 5^ = 0 for any nonempty 6C5. A similar enumeration on |a| 
then gives the result. 

The proof of the converse follows immediately from the expansion of h aU b. □ 

Proof of Theorem 12.3.11 Take A disjoint from {i,j} and note 

biAj = biA + ^ 5b- ( 21 ) 

jCBCAj 


This additivity recurrence follows directly from the definition of the forward differences 
at (l5il . Now use this in the elementary irnset representation at (J3J) 

A// | A hAi ) y^Aj h 4 ) 

= 6b - using 
jC.BC.Aij jCBCAj 

6b ■ 

ijCBCAij 
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Cancellation leaves only those terms with both i and j in the subscript, as required. □ 


Proof of Theorem 12.4.11 Firstly, we show £1213 = £123 — S 12 as the structure of the proof 
is contained in this special case. Take the definition of a conditional forward difference 


<5l2|3 — hi2|3 — hi|3 — h 2 |3 + |3, 

= hi23 — h -13 — /l23 + h 3 , 

simplified by applying 1ia\b = ^aub — h B repeatedly and noting that the four hs — h 3 
terms cancel. The term h 12 3 is a sum over the 2 3 elements of the power set "P({1, 2, 3}) 
of the signed function h. Partition this into the sum of those elements that contain 3 
and those that do not, then 

s™ = £ (-i) 2+1 -' C3 'fc C3 + £ (-i) 2+1 - |C 'fc c , 

CC 12 CC 12 

= 0i2|3 — m 2 

taking care with the signs, and as required. 

More generally consider (ITT]) : from the definition of conditional forward differences 

£>A\Bk = ^( — 1 )' A ' ^ C ^hc\Bk 

CCA 

= y>i) |A| - |c w 

CCA 


where the hu B terms cancel. The sum for Samb is partitioned into the sum over the 
power sets including and excluding k : 


&Ak\B 


CCA CCA 

&A\Bk ~ Sa\Bj 


as required. □ 

Proof of Theorem 12.4.21 When C separates A and B then as a consequence of the 
Markov properties of the graph XaA_X b \ Xq] consequently in turn Iiaub\c = hA\c + h B \c■ 
By a small generalisation of the the additivity Lemma [2~T1 to incorporate conditioning, 
the result follows. n 


Proof of Theorem 13.1.11 Endow the set A with a to tal ordering so that for i ^ j £ A 
either i < j or j < i. Apply the information identity, Cover and Thomasl 020061) . to get 


mt{X k ALX A ) = ^infpffcJLX,- \X {i , i<j} ). ( 22 ) 

jcA 


Use (J8]) of Theorem 12.3.11 to express the conditional mutual informations in terms of 
the forward differences, so 

inf (X k ALX A ) = J2 ~ ■ 

jCA BC{i- i<j} 
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Isolate the 2nd-order differences from sum and rearrange the index of summation gives 
the result: 


inf (1,11,) 


- ^2 22 

jeA jeA B<z{i-i<j},\B\>i 

^ inf (XjALXk)- 22 dBk ' 

j£A BQA-,\B\>1 


□ 

Proof of Theorem 14.1.11 The expression (1171) may be derived directly by evaluating the 
determinants in the Gaussian entropy. The second statement follows from (J9j) and the 
fact that /i 2 | 3 = - log(l - P? 2 | 3 )/2. □ 
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Jan 17, 2015 

A node cluster algorithm for computing forward differences 

We are given an undirected (simple) graph on p nodes and wish to compute forward 
differences of order k or less, for that graph. For k = 3 the difference is evaluated, 
if in the graph, the node i, say, has two neighbours j and k , so that this triple forms a 
node cluster. More generally, a subset (of any order) is viable if there is one node that 
is a neighbour to all other nodes. 

Examples focus on low order differences so we adopt a breadth first computation. We 
resolve orderings by choosing the weakest candidates based on the marginal mutual 
information {I t f, i,j G P}. This makes sense when adapting the algorithm to discard 
edges. 


Algorithm 1 A node cluster algorithm 
Increment n: starting with n = 3. 

• LOOP on nodes: to pass through whole graph. 

Choose node with maximum degree, not yet visited. 

• LOOP on all tuples (length k- 1) of its neighbours: 
visit weakest tuple first, via sum Mis. 

Check tuple forms a node cluster, 
put subset=(node,tuple), 
if subset is new store. 

Evaluate the entropy of the subset, store. 

• LOOP on all sub-subsets of the subset: 

evaluate forward difference, using stored entropies. 
If a relevant sub-subset unvisited, 
compute entropy, store, 

LTNTIL all sub-subsets, tuples, and nodes visited. 
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Correlation matrices 


GP burn-out data 

SAT11 B011 SAT12 B012 

SAT11 1.000 0.478 0.354 0.379 
B011 0.478 1.000 0.393 0.619 

SAT12 0.354 0.393 1.000 0.544 
B012 0.379 0.619 0.544 1.000 

Wine data 

archive.ics.uci.edu/ml/datasets/Wine+Quality 
dim(wine) 

4898 12 

qnrank=function(x){ 
n = length(x) 

qn = qnorm(seq(l:n)/(n+1)) 
return(qn[ rank(x ,ties ="random")]) 

> 

xqn = apply(wine,2,qnrank) 
data = xqn[,1:11] 
exclude quality as categorical 
noquote(colnames(data)) 

[1] fixed.acidity volatile.acidity citric.acid 

[4] residual.sugar chlorides free.sulfur.dioxide 

[7] total.sulfur.dioxide density pH 

[10] sulphates alcohol 

colnames(data)=NULL 
cor(data) 



[,1] 

[, 2] 

[, 3] 

[,4] 

[, 5] 

[, 6] 

[,7] 

[1,] 

1.00000 -0.030966 

0.31742 

0.09673 

0.08979 

-0.037524 

0.10117 

[2,] 

-0.03097 

1.000000 - 

-0.16975 

0.10579 

0.01394 

-0.084837 

0.11634 

[3,] 

0.31742 -0.169748 

1.00000 

0.04799 

0.04927 

0.089613 

0.10153 

[4,] 

0.09673 

0.105791 

0.04799 

1.00000 

0.20868 

0.319827 

0.41631 

[5,] 

0.08979 

0.013938 

0.04927 

0.20868 

1.00000 

0.162736 

0.35118 

[6,] 

-0.03752 -0.084837 

0.08961 

0.31983 

0.16274 

1.000000 

0.62336 

[7,] 

0.10117 

0.116342 

0.10153 

0.41631 

0.35118 

0.623356 

1.00000 

[8,] 

0.29980 

0.002518 

0.12049 

0.74793 

0.47804 

0.299242 

0.53097 

[9,] 

-0.43610 -0.045856 - 

-0.15785 - 

-0.16324 - 

-0.04935 

0.009448 

0.01014 

[10,] 

-0.01824 -0.034665 

0.07433 

0.01986 

0.09323 

0.068181 

0.16226 

[11,] 

-0.12775 

0.054468 - 

-0.05880 - 

-0.41334 - 

-0.53105 

-0.258037 

-0.44020 


[, 8] 

[, 9] 

[,10] 

[,U] 




[1J 

0.299804 -0.436098 

-0.01824 

-0.12775 




[2,] 

0.002518 -0.045856 

-0.03467 

0.05447 




[3,] 

0.120485 -0.157846 

0.07433 

-0.05880 




[4,] 

0.747925 -0.163236 

0.01986 

-0.41334 




[5,] 

0.478040 -0.049348 

0.09323 

-0.53105 




[6,] 

0.299242 

0.009448 

0.06818 

-0.25804 
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[7,] 

0.530971 

0.010136 

0.16226 

-0.44020 

[8,] 

1.000000 

-0.097515 

0.11178 

-0.80751 

[9 , ] 

-0.097515 

1.000000 

0.15959 

0.15053 

[10,] 

0.111781 

0.159591 

1.00000 

-0.03972 

[11,] 

-0.807506 

0.150530 

-0.03972 

1.00000 



